CS432 - Fall 2001

CS432 - Fall 2001
Assignment 5 - B+-Tree
Deadline: November 20, 23:59pm

Check this page and the newsgroup frequently for updates. The extra credit part of A3 will be graded in a week.

FAQ

· Question: It seems that clock() returns only the processor time, not the overall time.
Answer: Use time() instead to get accurate timings.

· Question: In reporting the statistics, ie. running time and number of page misses, do we have to vary the number of tuples in each relation besides varying the buffer pool size?
Answer: Use a large number of tuples to get a stable running time and repeat your experiments a few times as stated below in under "statistics collection". You do not have to vary the number of tuples.

· Question: When we do the join, should we include both of the join attributes or just one copy?
Answer: Use the function MakeNewRecord() to make the new record.

Introduction
For this assignment you are expected to implement three join algorithms and evaluate their relative performance.

Background
Let us quickly review some of the modules you will need for this assignment. You should already be familiar with them from the the previous assignments.

In Minibase, a relation is implemented as a heap file, which is a collection of records. Records can be inserted or deleted from a heap file, and each record is uniquely identified by a record id. A scan is the interface used to access records in a heap file, one by one.

An index provides fast access to records in the heap file, and currently Minibase only supports B+-indices. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index.

Four classes are provided for you : (i.e. you only need to call methods in these classses)

HeapFile - implements a heap file
Scan - scan interface to HeapFile
BTreeFile - implements B+-tree index
BTreeFileScan - implements scan interface to B+-tree

Refer to the reference section for the details of these classes and how to use them.

Statistics collection
Augment the buffer manager class given to you to collect statistics. Specifically, you should be able to tell how many PinPage requests have been made and how many of those miss. It is suggested that you write two functions to reset and report the statistics.

You can use the time() function (which returns the current time) to determine out the running time of your join algorithm. You will need to include time.h. Refer to the C++ help for more details on time().

You should let your algorithm run for a few times (the longer the better) and report the average running time. Avoid running other CPU or I/O intensive processes (Word, Netscape, etc.) while collecting statistics since clock() reports actual time rather than CPU time. Print out average running times and the number of misses for every join algorithm you implement.

Tuple-at-a-time nested loop join.
Start with this one, it is the simplest to implement.

Block nested loop join
The algorithm for this join method can be found in your textbook. Since Minibase does not provide page-by-page access into a relation stored in a heap file, you will simulate a "block" with an array storing the records. The join function will take as a parameter, an integer B (block size = array size) of the outer relation. You are not required to implement the double buffering techniques or use hash tables.

The pseudocode for this join is:

For each block B in R

  For each tuple s in S

    For each tuple r in B

      Match r with s

      if Match then

        Insert (r,s) into the result relation

Compare the performance of "Block nested loop" join for various block sizes.

Index nested loop join
Implement this algorithm based on the pseudo-code in the textbook.

First create an unclustered B+-tree index on the inner relation. Determine whether it is beneficial to build a B+-tree index for the purpose of performing a single join operation by comparing the cost of this join with that of other join methods.

Performance comparison

Compare the relative performance of the three join algorithms (tuple nested loops join, block nested loops join, index nested loop join). Collect the times taken for each algorithm to run, and the number of page misses.
Study the effect of the buffer pool size on the three algorithms by changing the buffer pool size (NUM_BUF_PAGES in main.cpp).
Submit documentation, tables and graphs of these statistics, together with analysis of the results.

Simplifications

You can assume that all records are fixed length.
You can assume that we only perform joins on integer fields.

Source code
The source code for the project is located in the directory \\goose\courses\cs432-fall01\a5\code. The directory contains:

join.cpp/.h - utility functions useful for writing join algorithms.
relation.cpp/.h- functions to create test relations.
<join-method>.cpp - each file contains a skeleton for implementing a particular join method.
main.cpp - main program.
spacemgr - subdirectory for space manager code (HeapFile class, DB class etc.)
btree - subdirectory for B+-Tree index code
bufmgr - subdirectory for buffer manager code
include - subdirectory of all .h files needed.

You should write your code in <join-method>.cpp and main.cpp. Investigate the code provided to understand how the test program works. Functions in join.cpp and relation.cpp will be useful for writing your join methods and for debugging. The function SortFile() is particularly useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan. To start working on the project, copy the directory to your workspace, and double-click on a6.dsw.

Hints

Study the Scan class. Learn how to access records in a heapfile. Study the SortFile function in join.cpp.
Use small numbers of records (by changing the constants in join.h) for testing.
A rough estimate: each join algorithm should be around 100 lines of code.
Since this time we are not building a library for Minibase, you are free to change interfaces of the join functions provided.
You are also free to modify anything in the main function, but do not change the functions that create relations so that we can compare your results with our own.
The user interface will not affect your grade. Explain how your program should be executed in the documentation. For a user interface you can either:

produce a few executables, one for each join
provide a menu interface to select the join algorithm to run
run all of the join algorithm sequentially

Reference
Those in bold are classes and types that are particularly useful for this assignment. You don't need to know the rest to complete the assignment, but if you wish to study the code provided, they may be of help.

Relations used in this assignment
struct JoinSpec

class DB
class Page

union DataType
union KeyType

enum AttrType
enum NodeType

struct KeyDataType

macro MINIBASE_DB
macro MINIBASE_BM

Coding convention
You need to follow certain coding convention for all your assignments:

Names for all classes/method/type should start with a caps. Break multiple words with caps. For example AddFileEntry.
Names for all members/variables should start with small letters. Break multiple words with caps. For example numOfPages;
Names for all enum/macro should be all caps. Break multiple words with underscore. For example MINIBASE_DB.
If you create a new function, make sure the header contains the necessary comments. (PreCond, PostCond, etc).

Submission procedure

How to hand in: You are a group of two students. Drop only the following five files into your handin directory on goose (\\goose\courses\cs432-fall01\handinA5).

blockjoin.cpp
indexjoin.cpp
tuplejoin.cpp
main.cpp
documentation

So if students with netids "abc1" and "xyz2" form a group together, one of the two handin directories (for example the directory \\goose\courses\cs432-fal01\handinA5\abc1) will contain the five files for the group. Make sure that your program can be recompiled as needed. You can submit the executable file, but we probably won't look at it. Delete the empty handin directory of the other student.

Keep a copy of the project in your own account just in case.

Grading criteria
Your assignments will be graded according to the following criteria:

Correctness (70%): You will get full marks for a correct implementation. Partial credit will be to partially correct answers.
Coding style (10%): You are expected to write neat code. Code should be properly indented and commented. You should follow the coding conventions.
Documentation (20%): Your documentation should contains results of your experiments in table and graph format. You should include a brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulas in the textbook.) Call your document a5.doc.